5 research outputs found
Learned Semantic Multi-Sensor Depth Map Fusion
Volumetric depth map fusion based on truncated signed distance functions has
become a standard method and is used in many 3D reconstruction pipelines. In
this paper, we are generalizing this classic method in multiple ways: 1)
Semantics: Semantic information enriches the scene representation and is
incorporated into the fusion process. 2) Multi-Sensor: Depth information can
originate from different sensors or algorithms with very different noise and
outlier statistics which are considered during data fusion. 3) Scene denoising
and completion: Sensors can fail to recover depth for certain materials and
light conditions, or data is missing due to occlusions. Our method denoises the
geometry, closes holes and computes a watertight surface for every semantic
class. 4) Learning: We propose a neural network reconstruction method that
unifies all these properties within a single powerful framework. Our method
learns sensor or algorithm properties jointly with semantic depth fusion and
scene completion and can also be used as an expert system, e.g. to unify the
strengths of various photometric stereo algorithms. Our approach is the first
to unify all these properties. Experimental evaluations on both synthetic and
real data sets demonstrate clear improvements.Comment: 11 pages, 7 figures, 2 tables, accepted for the 2nd Workshop on 3D
Reconstruction in the Wild (3DRW2019) in conjunction with ICCV201
Learned Multi-View Texture Super-Resolution
We present a super-resolution method capable of creating a high-resolution
texture map for a virtual 3D object from a set of lower-resolution images of
that object. Our architecture unifies the concepts of (i) multi-view
super-resolution based on the redundancy of overlapping views and (ii)
single-view super-resolution based on a learned prior of high-resolution (HR)
image structure. The principle of multi-view super-resolution is to invert the
image formation process and recover the latent HR texture from multiple
lower-resolution projections. We map that inverse problem into a block of
suitably designed neural network layers, and combine it with a standard
encoder-decoder network for learned single-image super-resolution. Wiring the
image formation model into the network avoids having to learn perspective
mapping from textures to images, and elegantly handles a varying number of
input views. Experiments demonstrate that the combination of multi-view
observations and learned prior yields improved texture maps.Comment: 11 pages, 5 figures, 2019 International Conference on 3D Vision (3DV
Towards Large Scale Dense Semantic 3D Reconstruction
With the increasing digitisation of various industries requiring digital twins for virtual interactions or simulations, the need for methods that generate 3D models is more pressing than ever. In particular, 3D reconstruction from images can play a significant part in the solution. When performed together with semantic segmentation, we obtain 3D models that display information about the nature of the scene components, which is valuable for applications in mixed reality and robotics. Moreover, the combined effects of 3D and semantics contribute to improving their quality by enabling reconstruction in areas poorly supported by data through shape priors. One state-of-the-art method considers the problem as a voxel labelling one, which can be solved using convex optimisation. Spectacular results are obtained with high-quality geometry per semantic category, at the cost of limited performances in terms of memory consumption. Hence, state-of-the-art methods are poorly scalable, limiting the size and resolution of the scenes they process as well as the number of semantic categories they can consider. We present several contributions to tackle this challenge. The first consists of a novel data structure dividing the scene into voxel blocks which allow for the presence of a subset of all the semantic categories. However, it necessitates hand-crafted shape priors, and designing them becomes intractable as more categories come into play. We thus introduce a lightweight neural network based on variational inference theory, which takes charge of learning these priors. We then extend it to an octree-based sparse volumetric scene representation bridging the gap between semantic and spatial scalability problems. Moreover, we propose a sensor-adaptive module, which allows our network to process scenes scanned from data degraded by heterogeneous noise models. We conclude with preliminary work on a new method that encodes the scene into latent orthographic space that can be built incrementally from aerial views and decoded into a semantic 3D scene. All along, experiments on indoor, outdoor, real and synthetic datasets with up to 40 categories show that our contributions achieved significant improvements in terms of memory footprint and producing accurate reconstructions without requiring large amounts of training data